Node-adaptive graph Transformer with structural encoding for accurate and robust lncRNA-disease association prediction

Background Long noncoding RNAs (lncRNAs) are integral to a plethora of critical cellular biological processes, including the regulation of gene expression, cell differentiation, and the development of tumors and cancers. Predicting the relationships between lncRNAs and diseases can contribute to a better understanding of the pathogenic mechanisms of disease and provide strong support for the development of advanced treatment methods. Results Therefore, we present an innovative Node-Adaptive Graph Transformer model for predicting unknown LncRNA-Disease Associations, named NAGTLDA. First, we utilize the node-adaptive feature smoothing (NAFS) method to learn the local feature information of nodes and encode the structural information of the fusion similarity network of diseases and lncRNAs using Structural Deep Network Embedding (SDNE). Next, the Transformer module is used to capture potential association information between the network nodes. Finally, we employ a Transformer module with two multi-headed attention layers for learning global-level embedding fusion. Network structure coding is added as the structural inductive bias of the network to compensate for the missing message-passing mechanism in Transformer. NAGTLDA achieved an average AUC of 0.9531 and AUPR of 0.9537 significantly higher than state-of-the-art methods in 5-fold cross validation. We perform case studies on 4 diseases; 55 out of 60 associations between lncRNAs and diseases have been validated in the literatures. The results demonstrate the enormous potential of the graph Transformer structure to incorporate graph structural information for uncovering lncRNA-disease unknown correlations. Conclusions Our proposed NAGTLDA model can serve as a highly efficient computational method for predicting biological information associations.


Background
According to a large number of cell biology experiments, lncRNA are RNA molecule that are not involved in protein coding and exceed approximately 200 nucleotides in length [1][2][3][4].At the beginning of the study, most researchers thought that lncRNAs were just an unimportant product in the transcription process.However, as biological experimental results continue to accumulate, researchers are slowly discovering that lncR-NAs are assumed to have very important roles in many important cell biological processes.They are involved in managing the cell cycle, managing embryonic development, the spatial and temporal control of gene expression, determining cell fates [5].Moreover, researchers in ongoing clinical experiments on human diseases have perceived that lncRNAs are inextricably linked to many human cancers [6,7] and have a decisive role in human cardiovascular physiological activity and its pathology [8].Therefore, researchers have regarded lncRNAs as a crucial factor in the study of human diseases and have explored the relationships between diseases and lncR-NAs as a new research direction to overcome the barriers of human diseases.Exploring the relationships between diseases and lncRNAs will lead us to deepen our understanding of disease mechanisms [9] and find the causative factors and sources of diseases from the genetic roots.At the same time, understanding the interactions between lncRNAs and diseases will allow us to intervene and regulate the expression of disease-related genes, and find new targets and strategies [10] for the treatment of diseases.Researchers have found that the expression levels of some lncRNAs are very prominent in certain diseases, so lncRNAs can be used as potential biomarkers and play a very important role in the early detection and treatment of diseases.In drug discovery, by exploring the relationship between diseases and lncRNAs, this can help us to investigate new and optimized drugs that are more effective.In addition, human genetic diseases [11] exhibit a close association with lncRNAs.Investigating lncRNAs allows for the elucidation of certain genetic diseases stemming from gene mutations, thereby expediting researchers' investigations into genetic disorders.However, it requires considerable time to study the linkage in real clinical experiments, requires significant material resources and is challenging to apply on a large scale.Therefore, the design of a novel computational model to compute the association between diseases and lncRNAs is of great importance in advancing the development of bioinformatics.There are some challenges in the actual study, namely: (1) Large datasets exhibit a low percentage of positive samples, resulting in significant sparsity that reduces the model's ability to predict positive samples effectively.(2) The availability of disease and lncRNA association data is limited, lacking a cohesive fusion of biological association data, and similarity calculations heavily rely on association matrices.
Many methods for calculating lncRNA-disease associations have been developed and their accuracy and reliability have been verified by biological experiments.Thus, to propose better calculation methods, researchers have collected a large quantity of data to create relevant benchmark databases.Gene Reference Into Function (GRIF) [12], DisGeNET [13], and Disease Ontology (DO) [14] are three standard databases related to diseases.RNADisease v4.0 [15], Lnc2Cancer [16] and LncR-NADisease [17] are three standard databases related to lncRNA-disease association.These standard databases were also created to break away from the previous way of thinking that one lncRNA corresponds to one disease and to perform global calculations and experiments on the benchmark dataset in the database by the proposed computational method.
Numerous computational techniques for exploring disease-lncRNA interactions have emerged with the continual advancement of diverse technology.We can classify the available computational methods into bioinformatics network-based methods [18] and deep learning-based methods [19].
Bioinformatics network-based models take known associations and their respective similarities to reconstitute heterogeneous networks and use a variety of different messaging mechanisms and random walks for the computation of potential associations on top of the constructed heterogeneity.For example, the KRWRH model [20] utilized the restarted random walks to compute associations between lncRNAs and diseases on top of integrating similarities between diseases, similarities between lncRNAs, and known associations into a new heterogeneous network.The RWRHLD model [21] combined all three of them into a heterogeneous network: observed relationships between lncRNAs and diseases, known associations between crosstalk network between lncRNAs and lncRNAs, and integrating similarity between diseases, based on which links between diseases and lncRNAs are inferred using a restart random walk approach.The IRWRLDA model [22] is a novel algorithm that improves upon traditional random walks by considering both lncRNA similarity and disease similarity for initialization probabilities.It can be used to infer new associations, even when the disease has no known association with any lncRNAs.The SIM-CLDA model [23] applied matrix completion and principal component analysis to infer potential associations.The NCPLDA model [24] capitalized on the networks consistency projection to obtain a new computational model for calculating new associations between lncRNAs and diseases.The GrwLDA model [25] generated a global network by combining identified lncRNA-disease interaction information, disease fusion similarity, and lncRNA fusion similarity and utilized this network to explore novel associations between diseases and lncR-NAs.The LRWRHLDA model [26] integrated multiple heterogeneous and homogeneous networks to construct a three-layer bioinformatics network using RWR to mine interactions.The LRWHLDA model [27] is designed to excavate the relationships between diseases and lncRNAs with a new idea based on localized random walk that takes full advantage of the topology of the network.The LncRDNetFlow model [28] integrated three interaction networks, disease interaction network, lncRNA interaction network and protein interaction network, to construct a three-layered heterogeneous network to obtain disease and lncRNA feature data.Nevertheless, none of these methods can perform comprehensive learning and fusion of local and global information, nor can they perform deeper network feature learning.
The deep learning-based lncRNA-disease association prediction models have shown significant improvements in performance compared to previous shallow models.The CNNLDA model [29] reorganized multiple sources of similarity and introduced miRNA datasets to enable the neural network model to learn more information.It utilized convolutional neural networks to learn node embeddings and inferred the associations between diseases and lncRNAs.The BiGAN model [30] employed generative adversarial networks for lncRNA-disease interaction calculations.It combined the similarity of lncRNAs and diseases and adopted a bidirectional generative adversarial network to infer their associations.The MCA-Net model [31] utilized embedded learning for multiple feature sources, ensuring that each node has a unique vector representation.It used attention-based convolutional neural networks to excavate direct interactions between lncRNAs and diseases.The ACLDA model [32] constructed a network based on metapaths using lncRNAs, miRNAs, and diseases.It introduced a novel approach that combines CNN and autoencoders for association prediction.The VADLP model [33] constructed multilayer graphs to integrate multiple similarities and employed variance autoencoders and CNN for lncRNAdisease interaction inference.The gGATLDA model [34] utilized attention mechanisms at the graph level.During the graph construction process, each disease-lncRNA pair is extracted to form a subgraph for lncRNA-disease relationship calculation.The MLMKDNN model [35] proposed a deep multi-kernel learning method, which included feature matrix construction, kernel space mapping, and deep neural network fusion.The kernel space mapping technique was applied to transform the feature matrix, enabling effective integration using deep neural networks for fusion.The MLGCNET model [36] employed multilayer graph autoencoder to obtain a representation vector of disease and lncRNA.The MGATE model [37] applied a multi-channel self-attentive encoder to learn latent embeddings of diseases and lncRNAs from multiple angles of the graph.The GANLDA model [38] incorporated multi-source data as initial features.GAT is adopted to get feature information about nodes and their neighbors and finally a multilayer perceptron is leveraged to screen the association.However, when building deep networks in graph neural networks, deep learning tends to cause over-smoothing during the node learning process, resulting in minimal differences between the vector representations of nodes.
A new trend of combining Transformers and graph neural networks to process graph data.This approach combines the parallelizability of Transformers, the advantages of their multi-head attention mechanism, and graph neural network methods to design new neural network models for graph data processing.Microsoft introduced the Graphormer [39], which, for the first time, utilized Transformers for graph-level tasks.It effectively integrated intermediate encoding, spatial encoding, and edge encoding into Transformers, successfully incorporating graph structural information.This integration has shown improved performance in widely used benchmark datasets for graph representation learning.Following this trend, a classic neural network model framework called GraphGPS emerged, which combines graph neural networks and Transformers [40].It used MLP to learn graph information, feeding it into both the graph neural network and the Transformer for graph representation learning.The fusion of the results obtained from both models leads to highly competitive outcomes.
Although these methods have achieved relatively good results in the task of lncRNA-disease association prediction, they still have limitations and shortcomings as follows: (1) Graph-based methods do not maintain good performance and robustness in the face of sparse large datasets and the problem of over-smoothing of node features can occur [41].Their learning ability is limited when confronted with complex heterogeneous graphs comprising different nodes and edges [42,43].(2) Traditional deep learning-based and bioinformatics network-based approaches do not capture both local and global information, and do not learn the features of nodes by fusing the information encoded in the graph structure.(3) In these existing methods, a simple linear fusion is also used for the fusion of features [23,24,26,38].The incorporation of adaptive and efficient fusion approach holds the potential for significant improvements in model performance and robustness.
Based on the aforementioned limitations of the existing methods and the inherent advantages of the Transformer model, we propose an innovative lncRNA-disease association prediction model named NAGTLDA.First, we construct a heterogeneous network by utilizing observed associations and compute the integrated similarity of diseases and lncRNAs to create their respective integrated similarity networks.Next, we employ node-adaptive feature smoothing (NAFS) [44] to perform local-level node embedding on the heterogeneous network and integrated similarity networks.Simultaneously, we utilize Structural Deep Network Embedding (SDNE) [45] to encode the structural information of the integrated similarity networks.Furthermore, we utilize the Transformer model for global-level embedding learning, allowing it to leverage its inherent global perspective to unearth potential association information.Finally, we employ the Transformer model to perform global-level fusion of all learned embeddings and incorporate the structural inductive bias of the network.This fusion approach effectively and significantly enhances the utilization of all captured information, thereby greatly improving the performance of inferring the associations between diseases and lncRNAs.Our proposed model outperforms these models that exist now in terms of performance and scalability.
In summary, our research makes the following key contributions: • We employ the NAFS method for feature embedding learning without the need for explicit training, and we utilize SDNE to encode the network structure.

Known human lncRNA-disease associations
In our experiment, we used a benchmark dataset to assess the effectiveness of our model.This dataset was obtained from previous research by Fu et al. [46] on lncRNA-disease association prediction, which includes 240 lncRNAs, 412 diseases, and 2697 experimentally validated lncRNAdisease interactions from the Lnc2Cancer [16], LncRNA-Disease [17], and GeneRIF [47] databases.We denoted the quantity of diseases and lncRNAs as N l and N d , respec- tively.We constructed an adjacency matrix A based on the observed interactions between lncRNAs and diseases, and A ∈ R N l ×N d , where A(l(i), d(j)) = 1 if there exists an iden- tified relationship between lncRNA l(i) and disease d(j) ; otherwise A(l(i), d(j)) = 0.

LncRNA functional similarity
There are multiple methods for expressing the similarity between lncRNAs, and one common method is based on their association with related diseases.By comparing the similarity of different lncRNAs with their associated diseases, their functional similarity can be assessed.In this experiment, we adopted the lncRNA functional similarity calculation method proposed by Chen et al. [48], which assumes that there are two lncR-NAs l 1 and l 2 , respectively, l 1 is linked to disease category where DS(d k , d) represents the semantic similarity between diseases d k and d.Based on the semantic simi- larity between the diseases and the associations between the lncRNAs and disease category, the formula for calculating the functional similarity of lncRNAs is as follows: where n and m denote the quantity of diseases in disease category D(i) and category D(j) , which can be repre- sented as

Disease semantic similarity
To compute the semantic similarity between diseases, their Medical Subject Headings (MeSH) descriptors can be used [49], and they can be denoted as a Directed Acyclic Graph (DAG) [50].Specifically, the hierarchical relationship of a disease can be represented as where T (d i ) represents d i and all its ancestor nodes, and E(d i ) is a set of edges from ancestral nodes to descend- ant nodes.Computing disease semantic similarity can be divided into three steps.For the first stage, for any disease d j in DAG(d i ) , its contribution towards the semantic simi- larity of disease d i can be computed using the following formula: (1) where parameter γ represents a hyperparameter set to 0.5 in the formula for disease semantic contribution.The second stage is to compute the total semantic value of the disease, which is computed using the following formula for DV d i : The third stage is to compute the semantic similarity between diseases d i and d j using the following formula:

Gaussian interaction profile (GIP) kernel similarity for lncRNAs and diseases
Gaussian kernel similarity is a common similarity measurement method that can map data to a multidimensional space and compute the similarity between data points.The calculated lncRNA functional similarity and disease semantic similarity are both relatively sparse, so it is necessary to introduce other similarities to compensate for this deficiency.Therefore, we decided to introduce GIP similarity, which can make the similarity between data nodes more obvious and facilitate the prediction of associations between nodes.The calculation formulas for GIP kernel similarity LK (l i , l i ) between lncRNA l i and l j and DK ( d i , d j ) between disease d i and d j are as follows: where comparable to reference [51], IP(l i ) and IP(l j ) rep- resent the i-row and j-row corresponding to the lncRNA in the known lncRNA-disease interaction matrix A,IP(d i ) and IP(d i ) represent the i-column and j-column corre- sponding to the disease in the known lncRNA-disease interaction matrix A. r l and r d are the kernel bandwidth control parameters and are defined by the following formula:

Integrated similarity networks for lncRNAs and diseases
Previously, we introduced GIP kernel similarity to compensate for the sparsity of lncRNA functional similarity and disease semantic similarity.Based on these (4) similarities, we calculate the integrated similarity matrix between diseases and lncRNAs using the following formula: where IL(l i , l j ) represents the integrated similarity matrix between lncRNAs, and ID(d i , d j ) represents the similarity matrix between diseases.To better utilize the integrated similarity matrices of lncRNAs and diseases, we use them to obtain their corresponding integrated similarity networks.We set two thresholds α and β to calculate the similarity network, and their formulas are expressed as follows: where I net represents the network obtained from the integrated similarity matrix of lncRNAs.If the similarity value between l i and l j is not less than or equal to thresh- old α , then I net (l i , l j ) = 1.Otherwise, I net (l i , l j ) = 0. D net denotes the network obtained from the integrated similarity matrix of diseases.If the similarity value between d i and d j is not less than or equal to threshold β , then

LncRNA-disease heterogeneous network
We constructed a lncRNA-disease heterogeneous network that includes the lncRNA similarity matrix, disease similarity matrix, and the known lncRNA-disease association matrix A: where A T represents the transpose of the lncRNA-dis- ease interaction matrix.

NAGTLDA
This section provides a detailed introduction to our proposed model, NAGTLDA, which accurately excavates the lncRNA-disease associations.The NAGTLDA process is shown in Fig. 1, which depicts the workflow and the sequence of steps involved in the NAGTLDA framework.(10) IL(l i , l j ) = LF (l i , l j ) if l i and l j have functional similarity LK (l i , l j ) otherwise ( 11) The model framework comprises the following parts: (1) using NAFS to learn local-level node feature embedding, (2) using SDNE to encode the structure of networks, (3) using a Transformer model with a multi-head attention layer to learn global-level node feature embedding, (4) using a Transformer model with two multi-head attention layers to learn embedding fusion at the globallevel, (5) predicting the association score between diseases and lncRNAs.
Local-level node feature embedding (node-adaptive feature smoothing) In recent years, GCN [52] has become very popular in graph neural networks (GNNs).This is because GCN can learn the features of all nodes in a graph based on both node features and graph structure.Using GCN to aggregate multi-order neighbour information in large graph networks leads to over-smoothing problems and requires a high computational cost and large memory consumption.To address this issue, Zhang et al. [44] proposed a model called NAFS, which aggregates and updates the features of nodes in a graph.Compared with GCN, NAFS not only solves the limitations of GCN but also significantly simplifies the model training intricacy and mitigates the occurrence of gradient vanishing and gradient explosion during backpropagation without the need for additional training.Since our model uses NAFS for node feature embedding for all three graphs ( I net , D net and G net ), we use G net as an example for illustration.The abbreviation for G net is G.We denote the quantity of nodes in G as n and the quantity of edges as m.Computing of NAFS consists of four steps.The initial step entails computing the oversmoothing distance, and the calculation is performed in the following manner: where [ G k X] i represents the i-th row in the matrix, which indicates the smoothed node representation of the ith node.Dis(•) represents a distance formula, which can be implemented using the Euclidean distance formula.G = D r−1 G D −r , D denotes the degree matrix of graph.r is a hyperparameter in the model.G represents the adja- cency matrix of the undirected graph with self-loops added.The calculation formula for G ∞ is as follows: where d i represents the degree of node i.The smooth- ing weight calculated in the second step is computed as follows: where K represents the maximum number of smoothing steps.The third step is to calculate the smoothing weight matrix, which is computed as follows: (15 where ϕ(k) ∈ R n and Diag(•) represents a diagonal matrix.We denote the initial input feature representation as X (0) .After l rounds of smoothing, the node feature matrix X (l) = GX (l−1) contains the feature of the previous round of smoothing.After K rounds of maximum smoothing, X (k) will contain more informa- tion, and we can obtain a collection of feature matrices X (0) , X (1) , X (2) , • • • , X (k) .Finally, the formula for smoothing feature X is as follows: The definition of X (0) is as follows: In GCN, a symmetric normalized adjacency matrix G = D r−1 G D −r is used.Setting r = 0.5 yields the symmet- ric normalized adjacency matrix results in a more diverse set of feature embeddings.The value of r controls the normalization weight of each edge, so different r values lead to distinct node feature embeddings for the same graph.We obtain a set of smoothed features X (0) , X (1) , X (2) based on this set of different r values, and we combine different smoothed features into Here, ⊗ represents a type of combination method, which can be replaced with the max function, concatenation, and mean function.
First, we input the heterogeneous network of the network nodes, which consists of nodes corresponding to lncRNAs and disease entities.We will compute a smoothing weight matrix W (k) for each k-step according to Eq. ( 18), then we use a list {r 1 , r 2 , r 3 , • • • , r U } .For each r-value in the list, we derive a new feature node embedding representation of the network structure from Eq. ( 19), denoted as X(u) ∈ R (N l +N d )×(N l +N d ) .The feature embeddings obtained from all the r-value are fused to obtain the final feature embedding ẐG ∈ R (N l +N d )×(N l +N d ) .The final NAFS is expressed as follows: where U denotes the length of the r-list and ⊗ represents the fusion mode of the features (Mean).( 18) Similarly, we use NAFS to process and obtain the corresponding lncRNA-integrated similarity network node features ẐL ∈ R N l ×N l and disease-integrated similarity network node features Z D ∈ R N d ×N d .We perform the node features in Z L affine, converting Z L and Z D to the same dimension: where

Network structure encoding
We learn the structural encoding of the network as the structural inductive bias and transfer it to the downstream Transformer module for processing.Here, we encode the network structure using the SDNE approach provided by Wang et al. [45] to conduct additional research on the information in the network.
In the model we encode the structure of the network with I net and D net .Here we use I net as an example to illustrate the process of SDNE.SDNE is composed of a decoder part and an encoder, where the decoder maps the input network with multiple nonlinear functions and the decoder applies multiple nonlinear functions to reconstruct the network.In I net = (V , E) , the adjacency matrix of the network is denoted by M , V denotes the collection of lncRNA nodes within the network, where |V | = N l .Then, the mapping and reconstruction of the network is performed as follows: where M i denotes the initial feature of the ith lncRNA in the network, σ(•) denotes the activation function, are the trainable parameters, and K is the number of layers of the decoder and encoder hidden layers.When y (k) i is obtained, the encoder will be reused to map to obtain the output M i .To make SDNE capture a more accurate network structure, second-order similarity and firstorder similarity are used here to construct the loss function of SDNE so that the error between the reconstructed network and the original network is smaller, and the SDNE loss function L sdne is calculated as follows: Here, ⊙ represents the Hadamard product.
M represents the adjacency matrix of the network, M(i, j) represents the value of the ith row and jth col- umn of the association matrix, and α is the hyperpa- rameter.L reg is a regularization term proposed to avoid overfitting, which is calculated as follows: We input a network G = (V , E) , where V denotes the set of nodes and E denotes the set of edges.Encode the network structure following the formulation in Eq. (23).Subsequently, decode the network structure by passing it through a decoding module, utilizing Eq. ( 26).Employ Eq. ( 24) for the first-order loss function, Eq. ( 25) for the second-order loss function, and Eq. ( 27) for the regularization function to compute the loss of the reconstructed network structure.This comprehensive approach aims to enhance the accuracy of the encoded network structure.Finally, output the result y (k) i obtained from the encoder.I net and D net denote lncRNA-integrated similarity network and disease-integrated similarity network.The final expression of the SDNE is as follows: where M ∈ R N l ×n p and D ∈ R N d ×n p , n p = K /2 , and K denotes the number of hidden layers in the decoder and encoder.We combine M and D into a new network struc-

Global-level embedding
In our model, we account for the limitations of the information contained in the local-level nodes.Therefore, we introduce a Transformer [53] module to learn globallevel node features and deeply explore the unknown associations between diseases and lncRNAs from a global perspective.The Transformer is utilized in the domain of graph neural networks and has significant implications for the future development of graph neural networks.In (25) NAGTLDA, we only need the Transformer encoder to learn the feature embedding of the global-level nodes.We take the node features Z G of the heterogeneous network as input to the Transformer, which is first processed through the multi-head attention layer as follows: where are the parameters to be trained in the model and n head repre- sents the quantity of multi-head attention heads.We obtain a set and finally, we obtain the output H from the multi-head attention: where, W H ∈ R (N l +N d )×n h is the training parameter and ⊕ represents the splicing operation.Then we feedforward propagate the output of the multi-head attention, and the feedforward network is defined as follows: where σ (•) represents a nonlinear activation function (LeakyReLU) and i denotes the quantity of hidden layers in the feedforward network.Here, given the initial input H , we can proceed to obtain the output X of the feedfor- ward network: where

Global-level embedding fusion
We have acquired local-level and global-level embeddings, and as it would be inefficient to combine these various embeddings using straightforward splicing or summing operations to produce the desired result, we continue to employ Transformer's decoder to carry out global-level node embedding fusion representation.Transformer does not employ the graph information transfer mechanism for graph computation; as a result, the structural inductive bias of the network is introduced to Transformer to compensate for the missing information transfer mechanism, resulting in excellent results for the model.Here, we employ two multi-headed attention layers, the first of which handles node embedding and the second of which incorporates (30) structural inductive bias of the network for developing the final node embedding representation learning.
First, we use the first multi-head attention layer to process the concatenation of the global-level embedding X and the local-level embedding Z LD .By applying the multi-head attention Eqs. ( 30), (31), and (32) along with the feedforward network Eq. ( 33) we obtain a new node embedding X F ∈ R (N l +N d )×n ′ h .Then, we use the second layer of multi-head attention to address the structural induction bias of the network.After concatenating the structural induction bias SF and node embedding X F , we similarly utilize Eqs. ( 30), ( 31), (32) for multi-head attention and Eq. ( 33) for the feedforward network to obtain a new representation of the node embedding X S .
We utilized the rich information of the heterogeneous network and the topological structure of integrated similarities networks for lncRNAs and diseases to perform node feature embedding learning at both local-level and global level.Simultaneously, we learned the structural information of the network.Finally, we fuse them using the Transformer structure to obtain the final node embedding representation

Predicting the association score between lncRNAs and diseases
We expressed the final node embedding expression as , where X S L ∈ R N l ×f indicates the ultimate node feature embedding of lncRNAs and X S D ∈ R N d ×f indicates the ultimate node feature embedding of diseases.The reconstruction of the lncRNA-disease interaction matrix A was performed using a bilinear decoder.The bilinear decoder formula is defined as follows: where W B represents the trainable parameter matrix.We can consider the lncRNA-disease link prediction task as a simple binary classification problem, so binary crossentropy loss is selected as the loss function for association prediction, which is calculated as follows: where (i, j) denotes the lncRNA and disease pairs, and the sets of data that are negative and positive data are represented by I − and I + , respectively.Our model's over- all loss function can be described as follows: where L l_p stands for the loss function of the recon- structed association matrix, whereas L 1 sdne and L 2 sdne reflect, the loss functions represented by the structures of the disease-integrated similarity and lncRNA-integrated similarity networks, respectively.In the overall optimization of our model, we added the Adam optimizer [54].
To achieve an equal distribution of negative and positive samples during the training phase of our model, an equivalent quantity of negative data is randomly chosen to enter the training.The training process of NAGTLDA is shown in Algorithm 1.

Experimental setting
During our experimental process, we employed 5-fold cross-validation (5-CV) to test the performance of our proposed model.We partitioned the disease-lncRNA pairs into five equal subsets, employing a four-to-one ratio for training and testing, which facilitated five cross-validation iterations.In each round, we removed all known associations from the test set and evaluated the performance of the trained model on the test samples.For selecting performance evaluation metrics, we adopted AUPR (area under precision-recall curve) and AUC (area under the receiver operating characteristic curve) as the major markers.Additionally, we considered five auxiliary reference metrics: recall, accuracy (ACC), F1-score, precision (Prec.), and specificity (Spec.).After conducting our 5-CV experiment, detailed results are presented in Table 1.Our model achieved an average accuracy of 0.8785 and average recall of 0.9088 on the experimental dataset.The average specificity and precision reached 0.8483 and 0.8578, respectively, while the average F1-score reached 0.882.In particular, the AUC and AUPR for our model are shown in Fig. 2. The average AUC and AUPR were 0.9531 and 0.9537, respectively.The results of the 5-CV experiment demonstrate the excellent performance of our proposed model in disease-lncRNA interaction prediction tasks.Several hyperparameters are included in the model, including the final embedding dimension (dim), maximum smoothing steps (k), learning rate (lr), encoding dimension for SDNE (nhid), number of Transformer layers (L1 and L2), number of attention heads for multihead attention (Head1 and Head2), r-value for NAFS, and weight decay for the optimizer.The best settings of hyperparameter optimization are presented in Table 2.The optimal parameter values are bolded, and these optimal parameters were chosen based on the model AUC.

Parameter analysis
During the process of setting hyperparameters, we found that certain parameter values have a noticeable impact on the model performance.For instance, we analyzed the dimensions of the final node features, as shown in Fig. 3.We compared different dimension values ( dim ∈ {32, 64, 128, 256, 512} ) and found that when dim = 64, the AUC and AUPR values are highest.Selecting an appropriate dimension to represent node features is crucial.If the dimension is too small, the distinguishability between nodes may not be clear.However, if the dimension is too large, it can result in a significant amount of redundant information.Therefore, the choice of embedding dimension as a hyperparameter is also vital for the model.
Then, we analyzed the maximum number of smoothing steps in NAFS, as shown in Fig. 4. The maximum number of smoothing steps indicates the number of neighbours aggregated in the process of aggregating neighbour nodes, which is equivalent to aggregating multi-order neighbours.We found that when hops = 7, the values of AUC and AUPR are the highest.When hops are greater than 7, they show a decreasing trend, and when they are less than 7, they show an increasing trend.After each smoothing, the following node features will contain all the previous smoothing information, so the number of smoothing steps is also very important for the learning of feature embedding.
In our model, we introduced the Transformer module, which includes a multi-head attention mechanism that provides us with a global perspective, enabling us to perform global-level embedding learning.We used two instances of the Transformer module in our model, and we found that different combinations of layer numbers (L1 and L2) have a significant impact on the model's performance.As shown in Fig. 5a, different layer numbers affect the model's AUC, while Fig. 5b illustrates the impact of different values of L1 and L2 on AUPR.The highest AUC value is achieved when the combination of (L1, L2) is set to (10,20), while the highest AUPR value is achieved when it is  set to (15,10).Additionally, different combinations of the quantity for the attention heads, Head1 and Head2, also affect the prediction efficiency of the model.As depicted in Fig. 6a, the varying combinations of Head1 and Head2 influence the AUC values, with the highest value observed when it is set to (8,64).In Fig. 6b, we can observe that the highest AUPR value is achieved when the combination of Head1 and Head2 is (8,64).

Performance comparison with different ratios
The different proportions of negative and positive samples in each fold of cross-validation can also impact the model's performance.Therefore, we set the proportions between positive samples and negative samples in each fold as follows: positive samples: negative samples = {1:1, 1:5, 1:10, random}, for experimental purposes.The detailed outcomes of the studies are presented in Fig. 7.
Fig. 3 The effect of different embedding dimensions on the AUC and AUPR of NAGTLDA Fig. 4 The effect of different maximal smoothing steps on the AUC and AUPR of NAGTLDA We can observe that when the ratio = 1:1, indicating a balanced ratio of positive and negative samples, the AUC and AUPR values are the highest at 0.9531 and 0.9537, respectively, but the corresponding accuracy is the lowest.When the ratio = 1:5, the AUC and AUPR values are slightly lower than those of the ratio = 1:1, but the accuracy is slightly higher.When the ratio = 1:10, the AUC value is the lowest, but the accuracy is higher than the previous ratios.When the ratio is set to random, the AUC value is ranked third, and the AUPR value is the lowest, but the accuracy is the highest at 0.9783.We speculate that the reason for these results may be due to the low proportion of positive samples in the experimental dataset.If we balance the positive and negative samples in each fold, it leads to the smallest quantity of training data in each fold, resulting in the lowest model accuracy.As the proportions between positive and negative samples decrease, the quantity of training data in each fold also decreases, leading to a decrease in accuracy.

Performance comparison with other methods
In our experiments, we compared our model with six state-of-the-art computational methods on a benchmark dataset D1 using a 5-CV approach, which are as follows: • HGATLDA (2022) [55]: A meta-path-based heterogeneous graph attention network framework was used to perform interaction prediction between diseases and lncRNAs by constructing disease, lncRNA, and gene heterogeneity networks.• SFGAE (2022) [56]: A graph self-encoder was utilized for feature learning of nodes and self-featured representations of miRNAs and diseases were constructed for association prediction between miR-NAs and diseases.• VGAELDA (2021) [57]: An end-to-end computational model based on a variational self-encoder and graph self-encoder was adopted to predict the relationships between diseases and lncRNAs.• LAGCN (2020) [58]: A layer-attentive graph convolution network was used to synthesize multisource similarity to construct heterogeneous network for association prediction between drugs and diseases.• LDA-LNSUBRW (2020) [59]: A computational method based on unbalanced double random wandering and linear neighborhood similarity for association prediction between diseases and lncRNAs.• CNNLDA (2019) [29]: A dual convolutional neural network model based on an attention mechanism that integrates multiple sources of data was used to excavate the associations between diseases and lncRNAs.
For benchmark dataset, the D1 downloaded from the Lnc2Cancer [16], LncRNADisease [17] and GeneRIF [47].The dataset utilized in this study was sourced from the previous research conducted by Fu et al. [46] on lncRNAdisease association prediction.The dataset comprises 240 lncRNAs, 412 diseases, and 2,697 experimentally validated lncRNA-disease interactions.The semantic similarity data for all diseases is obtained from MeSH.
In the benchmark dataset D1 experiments, we compared different models using two evaluation metrics, namely, AUC and AUPR, to facilitate better comparison between models.The experimental results are presented in Table 3, where we highlight the Fig. 7 The effect of different ratios of positive and negative samples on the performance of NAGTLDA  8 shows the AUC and AUPR curves of all models obtained through 5-CV experiments.It is evident from the figure that NAGTLDA outperforms other models in terms of performance.To visually highlight the performance disparity between NAGTLDA and existing state-of-the-art methods, we conducted a significance analysis of their AUC values, represented in Fig. 9 (* denotes P < 0.05, ** denotes Fig. 8 ROC curve and PR curve of the proposed method and six baselines under the 5-CV settings Fig. 9 Significance analysis of other models with NAGTLDA on the D1 dataset P < 0.01, *** denotes P < 0.001).Notably, the significance levels of NAGTLDA compared to other methods are consistently high, ranging from a minimum significance of P < 0.05 to a maximum significance of P < 0.001.The improvement in the performance of our model has a significant enhancement for uncovering unknown lncRNA-disease associations.Hence, we can infer that our proposed model demonstrates excellent performance and serves as an effective computational approach for predicting disease-lncRNA associations.
Compared with these state-of-the-art methods, our model exhibits a significant performance advantage, as confirmed in the experiments above.The enhancement in performance can be attributed to the following unique contributions: NAFS is utilized to learn local features of nodes, simplifying the model training process and enhancing effectiveness.Moreover, the incorporation of network structure encoding enhances the efficiency of graph node information learning.Lastly, the application of the Transformer architecture allows for the learning of global information of nodes in the graph.The global and local features are then adaptively and efficiently fused using a multi-head attention approach, resulting in comprehensive feature information for diseases and lncRNAs.

Performance on other datasets
To further validate the performance and generalization ability of the NAGTLDA model, we performed experiments on a larger lncRNA-disease association dataset D2 and a miRNA-disease association dataset D3, as shown in Table 4.
• D2: We screened the data from the databases of known lncRNA-disease associations, including LncRNADisease v2.0 [60] and Lnc2Cancer v3.0 [61], known lncRNA-miRNA associations from Encori [62] and NPInter V4.0 [63], and known miRNA-disease associations from HMDD v3.2 [64].All disease names were converted to standard MeSH disease terms to facilitate the calculation of semantic similarity between the diseases.After removing redundant data, the final merger yielded 861 lncRNAs, 432 diseases, and 4516 known lncRNA-disease associations.The features used to make semantic similarity of diseases in the model are obtained from MeSH.• D3: The known miRNA-disease association data were downloaded from the HMDD v3.2 database [64], and we obtained 788 miRNAs, 374 diseases, and 8968 corresponding known associations from the screening.The features used to make semantic similarity of diseases in the model are obtained from MeSH.
We conducted 5-fold cross-validation experiments on the D2 and D3 datasets, and the results are presented in Table 5. Comparing the experimental outcomes of the original dataset with the D2 dataset, we observed that the model performs better on D2.This improved performance can be attributed to the incorporation of the Transformer structure into the NAGTLDA model, enhancing its performance on larger datasets.The Transformer, originally designed for large-scale natural language processing tasks, brings notable advantages to our model, allowing it to excel on larger datasets.
On the D3 dataset, we achieved remarkable results with AUC and AUPR values exceeding 0.94, while the F1-score reached 0.8746.These outcomes indicate that our model possesses strong generalization capabilities.It not only performs well in predicting lncRNA-disease associations, which is the primary focus of our study, but also demonstrates high performance on other non-coding RNA datasets.
We established independent validation sets to assess the performance of our model, following the methodology outlined by Fu et al. [65].For the D1 dataset, which contains 2697 positive samples, we initially selected 20% of the positive samples and the same number of negative samples to construct an independent balanced validation set (B-validation set).The remaining samples were utilized for training.Subsequently, we randomly extracted 20% samples from the D1 dataset to  create an unbalanced independent validation set (Unbvalidation set), while the remaining samples served as the training set.The experimental results on these two independent validation sets are summarized in Table 6.
We assessed the model's performance on the two independent validation sets in comparison to its performance on the benchmark dataset.Notably, there was a decrease in performance on the independent validation sets, specifically in terms of the two primary metrics, AUC and AUPR.Despite this decrease, the model still demonstrated relatively good results.Furthermore, the AUC and AUPR on the unbalanced independent validation set were slightly lower than those on the balanced validation set.This trend was observed in both balanced and unbalanced datasets, suggesting the need to explore strategies for choosing an optimal ratio of positive and negative samples to enhance the comprehensiveness of model comprehensiveness during training.
After comparing NAGTLDA with other state-of-theart models in previous experiments on the D1 dataset, we extended our evaluation to two larger datasets, D2 and D3.We analyzed the significance of their AUC values, as illustrated in Figs. 10 and 11, to assess computational efficiency and scalability across models.Notably, NAGTLDA exhibited remarkable significance compared to other models on both datasets, with particularly noteworthy results on the D2 dataset, where the significance compared to other state-of-the-art models reached P < 0.001.
The reason for the strong scalability of our model is as follows: (1) Our model applied SDNE to learn the  However, there are some limitations of our proposed model on large dataset.Large datasets are commonly imbalanced in positive and negative samples, which requires to introduce multi-source features to compensate for the shortcomings of sparse positive samples.Moreover, there are many hyperparameters in the model, and the model application on large datasets may cause overfitting phenomenon for too many parameters.

Feature visualization
To display the effectiveness of our proposed model more specifically and graphically, we visualize the lncRNA-disease pair features learned by the model for comparison.We used t-SNE [66] to downscale the lncRNA-disease pair features and plot them in the twodimensional plane to compare the learned pair features with the original pair features.As shown in Fig. 12, we visualize the original pair features (left) and the learned pair features (right).In the visualization, we distinguish the negative samples from the positive samples with different color dots, and we can observe that the lncRNA-disease pairs learned by NAGTLDA are more concentrated and distinguishable than the original positive and negative samples respectively.This also indicates that our model is meaningful and interpretable for disease and lncRNA feature learning.

Ablation experiments
To assess the influence of each module on the model performance and its importance, three sets of ablation experiments were performed for validation.
The first set of ablation experiments is to remove a module from the initial model to construct a comparison model, and each new comparison model is described as follows: The results obtained from the experiments are presented in Fig. 13 and Table 7, and the original NAGTLDA model has excellent results compared to other comparable models.For example, on both the AUC and AUPR, NAGTLDA outperforms remove disease-SDNE by values of 0.0181 and 0.0133, respectively.We observe that encoding the network structure information exerts the most significant impact on the overall model performance.Consequently, the acquisition of node-level information within the network holds great importance.However, a comprehensive understanding of the network's structural information also emerges as a vital component.The overall performance of the new model formed by removing a module is lower than that of the original model, thus proving the effectiveness of our use of Transformer layer for global-level embedding, NAFS for local-level embedding, and SNDE for network structure encoding.
The second set of ablation experiments was conducted by replacing the method used for local-level embedding in the model with the classical GCN and GAT in graph neural networks to construct the comparison models: NAGTLDA_gcn and NAGTLDA_gat.As shown in Table 8 and Fig. 14, NAGTLDA performs better than the variant model.Specifically, NAGTLDA is 0.0106 higher than NAGTLDA_gcn in terms of AUC value, 0.0079 higher than NAGTLDA_gat in terms of AUPR, and 0.0158 higher than NAGTLDA_gcn in    9. Six of the seven evaluation metrics in the experimental results are the highest when the mean operation is used.

Case study
In the previous sections, we tested and confirmed the effectiveness of NAGTLDA.Now, we evaluate NAGTL-DA's ability to excavate unknown relationships between diseases and lncRNAs.We chose four common diseases, which are prostate cancer, colon cancer, breast cancer, and colorectal cancer, as case studies from the dataset.We trained the model with 2797 observed lncRNA-disease relationships as instances for training and then made predictions for unknown potential associations.We extracted the top 15 candidate lncRNAs for each disease and validated the results using three benchmark databases: LncRNADisease v2.0 [60], Lnc2Cancer 3.0 [61], and MNDR v3.1 [67].
The exact cause of colon cancer is still unknown, but studies and research have shown that the risk of developing the disease increases with age, obesity, and cancer in other parts of the body.As research continued, researchers found that colon cancer is closely linked to several lncRNAs.For example, CYTOR and the corresponding protein binding can contribute to the metastasis of colon cancer [68], and HOXB-AS3 expression can inhibit the growth of colon cancer [69].The experimental outcomes are presented in Table 10, where 14 of the top 15 candidate lncRNAs have been confirmed.
The most prevalent malignancy is prostate cancer in the male urological system, which is highly prevalent in older men, but its etiology has not yet been fully identified.Researchers have found that prostate cancer is closely related to the expression of lncRNAs.For example, the expression of MAGI2-AS3 and MEG3 in  Breast cancer is the most common cancer among women.According to research, obesity, excessive alcohol consumption, and overnutrition all increase the incidence of breast cancer, but thus far, medical researchers have not found the exact cause of cancer.With the persistent expansion of bioclinical technology, growing number of lncRNAs related to breast cancer have been discovered.For example, the distant metastasis-free survival, overall survival, and progression-free survival of breast cancer patients are strongly associated with high expression of BCAR4, LUCAT1, and TINCR [73][74][75].LINC00511 binds to the MMP13 protein to promote breast cancer cell migration and proliferation [76].We used breast cancer as the third type of disease in the case study, and the experimental outcomes are presented in Table 12.All of top 15 candidate lncRNAs have been validated by the relevant literature.
Colorectal cancer is the third most common malignancy in the world, and its incidence is relatively    [77] and that GAS5 and YAP phosphorylation and degradation interact to inhibit the development of colorectal cancer [78].We used it as the fourth disease in our case study, and the experimental outcomes are presented in Table 13, where 13 of the top 15 candidate lncRNAs we selected have been validated by the relevant literature.

Discussion
In the present paper, we designed a NAGTLDA computational model to make inferences about unknown interactions between lncRNAs and diseases.Based on the experimental results, our model demonstrates promising performance, particularly in handling large datasets.The high scalability across varying sizes of datasets can be ascribed to the utilization of the graph Transformer architecture for extracting feature representations.This architecture possesses a highly expressive and adaptive learning capability, enabling it to learn diverse networks effectively.However, our proposed model and the current study have some limitations.The limitations of our model are as follows: (1) The main framework of our model is built upon the Transformer architecture, requiring considerable computational power during the training process, particularly in practical applications involving large datasets.(2) The existence of numerous hyperparameters necessitates meticulous optimization and tuning, thereby augmenting the complexity of the training process.(3) Our model also relies on the initial similarity features of the nodes, which are calculated based on the association matrix.There are some limitations in the present field of lncRNA-disease association prediction as follows: (1) There are no true negative samples in the experimental data, and all the biological data are looking for true positive samples and not paying much attention to negative samples.Negative samples may be correct or they may be undetected false negatives.(2) The experimental results of computational modeling do not correlate very well with biological experiments, and better integration of computational modeling and biological experiments makes the results better interpretable.In future research, we can start by studying the dataset and exploring how to better represent the correlations between entities, which will result in a more accurate discovery of unknown associations.In addition, as medical science and technology continue to advance, the discovery of more unknown lncRNAs, represented as isolated nodes, is anticipated.Moving forward, there is a pressing need to develop more comprehensive models that can accurately predict the associations between these isolated nodes and experimentally verified disease nodes.

Conclusions
In the model, we first framed a heterogeneous network consisting of diseases and lncRNAs, an integrated similarity network for diseases and an integrated similarity network for lncRNAs, and used NAFS to perform node-level embedding for each of the three networks.We also adopted SDNE to encode the structural information of the networks with the goal of utilizing the constructed networks more effectively.We then introduce the Transformer module for global-level embedding to explore potential unknown associations in the dataset and utilize the Transformer fusion mechanism with two levels of attention to perform global-level embedding fusion on the learned embeddings and network topology.We performed embedding learning on the network information from both local and global perspectives so that some potential associations can be better identified.Finally, a bilinear decoder is employed to fuse the node embedding representations of diseases and lncRNAs as input for lncRNA and disease association prediction.We also conducted experiments on the performance of our model, and the outcomes of the 5-CV and contrast to other baseline models confirm the excellent performance of our model.In the case study, NAGTLDA successfully predicted associations, such as NEAT1-colon cancer, SOX2-OT-prostate cancer, and WT1-AScolorectal cancer, which were previously unknown in the dataset.He et al. [79] investigated the function of NEAT1 in colon cancer, and found that the expression of NEAT1 was significantly elevated in colon cancer cells in their experiments, which proved that NEAT1 indirectly promotes the occurrence of colon cancer.Song et al. [80] demonstrated that SOX2-OT inhibits the proliferation and metastasis of prostate cancer cells by interacting with other non-coding RNAs.This discovery provides a new therapeutic approach for the treatment of prostate cancer.Zhang et al. [81] experimentally demonstrated experimentally that WT1-AS was closely associated with overall survival in colorectal cancer.The correlation between WT1-AS and colorectal cancer was demonstrated on clinicopathological features and data modeling analysis, and WT1-AS can be used as a biomarker and therapeutic target for colorectal cancer prognosis.This proves that our proposed model performs very well in finding new therapeutic strategies for diseases and provides a solid foundation for biological experiments and clinical practice.

Fig. 1
Fig. 1 The NAGTLDA workflow.Step1: Construct the integrated similarity network, extract the local features of the heterogeneous network and the integrated similarity network adopting NAFS, and encode the structural information of the integrated similarity network applying SDNE.Step2: Learn global information of heterogeneous network nodes by Transformer architecture.Step3: Adaptively fusing local information of nodes, global information and structural coding of the network by Transformer architecture.Step4: Predict associations using bilinear encoder

Fig. 2
Fig. 2 ROC curves and PR curves of NAGTLDA in 5-CV

Fig. 10
Fig. 10 Significance analysis of other models with NAGTLDA on the D2 dataset

Fig. 11
Fig. 11 Significance analysis of other models with NAGTLDA on the D3 dataset

Fig. 12
Fig. 12 Comparison of visualization features of lncRNA-disease pairs obtained by NAGTLDA and the original accuracy.NAGTLDA compared to NAGTLDA_gcn and NAGTLDA_gat in F1-score is the highest, and the F1-score is a benchmark indicator for the comprehensive ability of the model, so the original model is a better choice.Combining the outcomes of the first set of ablation experiments and the present set of experiments, it can be concluded that using NAFS for embedding learning of node features is an efficient learning method, and it also proves the effectiveness and efficiency of using NAFS in the whole model.The third set of ablation experiments is conducted for NAFS.We input a set of r values to obtain a set of different node feature representations, and we can use different ways to process this set of node feature representations.NAGTLDA_concat, NAGTLDA_max and NAGTLDA_simple represent the use of concatenate, max and simple operations, respectively.The simple operation means inputting only one r value to one experimental result.The detailed experimental outcomes are presented in Fig.15 and Table

Fig. 15
Fig. 15 Comparison results of NAGTLDA, NAGTLDA_concat, NAGTLDA_max and NAGTLDA_simple and l 2 is linked to disease categoryD(j) = d j1 , d j2 , d j3 , • • • , d jm .The formula for calculating the similarity score between disease d k ∈ D(i) and disease category D(j) provided here is:

Table 3
Performance comparison between our proposed method and six baselines under 5-CV settings It can be observed that our proposed NAGTLDA model achieves the highest AUC and AUPR values.This improvement can be attributed to the utilization of a Transformer for global learning during the process of learning node features.NAGTLDA outperforms LDA-LNSUBRW by 8.92% in AUC and 5.51% in AUPR. Figure

Table 4
Details about datasets

Table 5
NAGTLDA performance under D1 and D2 datasets

Table 6
Performance of NAGTLDA on D1 dataset and independent validation set

Table 7
Performance between NAGTLDA and multiple variant models

Table 8
Performance of NAGTLDA based on different local-level embeddings methods Fig. 14 Comparison results of NAGTLDA, NAGTLDA_gcn and NAGTLDA_gat

Table 9
Performance of NAFS based on different fusion methodsWe used it as the second disease in the case study, and the experimental outcomes are presented in Table11.Thirteen of the top 15 candidate lncRNA species we identified have been confirmed by the relevant literature.

Table 10
The top 15 predicted lncRNAs associated with colon cancer

Table 11
The top 15 predicted lncRNAs associated with prostate cancer

Table 12
The top 15 predicted lncRNAs associated with breast cancer similar in men and women.The majority of the population suffers from the disease due to lifestyle habits, and a very small percentage is due to genetic factors.Colorectal cancer ranks second in the number of deaths caused by malignant tumors.Researchers have found through numerous clinical trials that ITGB8-AS1 combined with the corresponding signals can contribute to the growth and metastasis of colorectal cancer

Table 13
The top 15 predicted lncRNAs associated with colorectal cancer